27 May, 2021

The road to becoming a data scientist

Course layout

  1. Data (geo)science project

  2. Exploring your data

  3. Modelling

  4. Wrangling

Each section:
5–10 minutes of introduction
5–10 minutes of live coding and questions

PAGES R package

R for Data Science (R4DS)

This class is completely based on Hadley Wickham’s and Garrett Grolemund’s R4DS.

I have augmented the examples with cases from geology.

Wickham and Grolemund (2016)

Tidyverse

The tidyverse universe: opinionated collection of R packages designed for data science

  • ggplot2: A modular graphing tool for custom solutions
  • dplyr: Getting data into the right format
  • readr: Opinionated import tools for rectangular data.
  • tidyr: Cast data into a consistent tidy format

The datasets

  • In stratigraphic work we look at variations in a the value of variable through time (or time series).
  • The time unit is measured in height and depth and can sometimes be calibrated for time (age model).
  • Uniformitarianism principles of geology.
  • R package PAGE has the lazy load data: bonenburg (geochemistry) and kuhjoch (palynology).

Data (geo)science project

Wickham and Grolemund (2016)

Data (geo)science project

  1. Beginner: Emphasis on R objects and output
  2. Expert: Emphasis on scripts and functions

RStudio projects

Reproducible workflows

  1. A clear directory and file structure with meta-data to describe data. Raw data should be read-only and backed-up.

  2. R script with a clear documentation of all steps involved.

  3. Publish all aspects of this workflow along with your paper.

Pipes

Pipe: %>%

  • intermediate steps
kuhjoch_grps <- group_by(kuhjoch, type)
kuhjoch_mean <- summarise(kuhjoch_grps, count)
  • function composition
kuhjoch_mean <- summarise(group_by(kuhjoch, type), count)
  • pipe
kuhjoch_mean <- group_by(kuhjoch, type) %>%
  summarise(count)

Break 1

Exploratory data analyses

Exploratory data analyses

Wickham and Grolemund (2016)

Grammar of graphics

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey



ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>,
     orientation = <ORIENTATION>
  ) +
  <FACET_FUNCTION>

Grammar of graphics (Wilkinson et al. 2005)

Aesthetics mapping

Scatterplot: maps each observation to a horizontal and vertical position and the geom represents this as a point

ggplot(data = bonenburg) +
  geom_point(mapping = aes(x = del13Ctoc, y = height)) 

Aesthetics mapping

Aesthetics mapping

The colour, shape and linetype can also be used to map additional variables. Here I use the stratigraphy (categorical) as an additional variable.

ggplot(data = bonenburg) +
  geom_point(mapping = aes(x = del13Ctoc, y = height, colour = strat))

Statistical transformations and Facets

Internal (statistical) transformation of <DATA>.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
    mapping = aes(<MAPPINGS>),
    stat = <STAT>,
    position = <POSITION>,
    orientation = <ORIENTATION>
    ) +
  <FACET_FUNCTION>

Visualizing distributions

Typical questions

  1. Which values are common?
  2. Which values are rare?
  3. And can we discern patterns in the previous examples?

Wickham and Grolemund (2016)

Visualizing distributions

ggplot(data = bonenburg) + 
  geom_boxplot(mapping = aes(y = strat, x = del13Ctoc), stat = "boxplot")

Visualizing distributions

ggplot(data = bonenburg) +
  geom_boxplot(mapping = aes(y = reorder(strat, height), x = del13Ctoc))

Facets

The facets splits data according to a categorical variable.

ggplot(data = bonenburg_long) +
  geom_point(mapping = aes(x = value, y = height)) +
  facet_grid(cols = vars(measurement), scales = "free_x") 

Break 2

Patterns and models

Patterns and models

  • Discern patterns (or signals) from noise.

  • Exploration, not confirmation or formal inference!

Wickham and Grolemund (2016)

Co-variation continuous–continuous

ggplot(data = bonenburg_cross, mapping = aes(x = value, y = del13Ctoc)) +
  geom_point(aes(colour = strat)) +
  facet_wrap(facets = vars(measurement), scales = "free") 

Co-variation continuous-continuous

ggplot(data = bonenburg_cross, mapping = aes(x = value, y = del13Ctoc)) +
  geom_point(aes(colour = strat)) +
  geom_smooth() +
  facet_wrap(facets = vars(measurement), scales = "free") 

Transforming variables

ggplot(
  bonenburg, 
  aes(x = TOCcfb, y = del13Ctoc)
  ) +
  geom_point(aes(colour = strat)) +
  geom_smooth(method = "lm") 

lm(del13Ctoc ~ TOCcfb, bonenburg)

ggplot(
  bonenburg, 
  aes(x = log(TOCcfb), y = del13Ctoc)
  ) +
  geom_point(aes(colour = strat)) +
  geom_smooth(method = "lm") 

lm(del13Ctoc ~ log(TOCcfb), bonenburg)

Regression models

This was a very simple, exploratory analysis of the data.

Fitting models:

  • additive components an interactions (multivariate regression)
  • residual diagnostics to validate the assumptions of the model
  • correlative structures (time series have a correlation between time steps)
  • grouping structures (mixed effect models)

Further reading:

  • Peter Dalgaard 2008 Introduction to statistics with R
  • John Fox & Sanford Weisberg 2018 An R companion to applied regression
  • Alain Zuur et al. 2008 Mixed Effects Models and Extensions in Ecology with R

Break 3

Wrangling

Wrangling

Wickham and Grolemund (2016)

Reading and writing data

Load your data into R with readr package

  • read_csv(): comma separated (CSV) files
  • read_tsv(): tab separated files
  • read_delim(): general delimited files

Description on website: “In many cases, these functions will just work!”

Reversely, you can also write back to several file formats with write_*

Reading and writing data

PAGES_example()
## [1] "bonenburg_raw.csv" "kuhjoch_raw.csv"
read_csv(PAGES_example("bonenburg_raw.csv"))
## # A tibble: 108 x 13
##    SampleID Height CaCO3    TN del13Ctoc TOCcfb `Al2O3 (%)` `Na2O (%)` `K2O (%)`
##       <dbl>  <dbl> <dbl> <dbl>     <dbl>  <dbl>       <dbl>      <dbl>     <dbl>
##  1        0   3.01 13.3   0.06     -27.5   1.16        15.6       0.62      3.25
##  2       60   3.56  2.67 NA        -25.5   0.27        13.1       1.27      3.33
##  3      100   3.95  3.84  0.07     -27.3   0.96        17.4       0.55      3.54
##  4      150   4.43  5.86  0.07     -27     1.25        17.6       0.44      3.79
##  5      200   4.94 12.8   0.07     -27.8   1.52        16.4       0.48      3.73
##  6      250   5.25  3.34  0.09     -27.6   2.45        14.6       0.61      3.42
##  7      275   5.68  9.91  0.06     -27     1.19        17.3       0.44      4.18
##  8      300   5.92 NA    NA         NA    NA           15.5       0.46      3.74
##  9      300   6.16 22.2   0.06     -27.1   1.21        NA        NA        NA   
## 10      350   6.41 20.5   0.06     -27.5   1.14        15.1       0.51      3.92
## # … with 98 more rows, and 4 more variables: Strat <chr>, Strat2 <chr>,
## #   Section <chr>, Reference <chr>

Tidy data

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Long and wide format

Transforming

Create, rename, reorder variable and summarise with tidyverse dplyr.

  • mutate() e.g., K/Al from K and Al
  • select() e.g., pick height and K/Al
  • filter() e.g., all observations above 3 meters
  • summarise() e.g., combined with group_by() calculate mean value of K/Al for lithological units
  • arrange() e.g., arrange ascending with height

grouping: group_by() or rowwise()

Transforming variables

XRF oxides and normalization with elemental ratios

mutate(
  bonenburg_tidy,
  # oxide correction
  Al_pc = Al2O3_pc * with(marelac::atomicweight, 2 * Al / (2 * Al + 3 * O)),
  Na_pc = Na2O_pc * with(marelac::atomicweight, Na / (Na + 2 * O)),
  K_pc = K2O_pc * with(marelac::atomicweight, K / (K + 2 * O)),
  .keep = "unused"
  ) %>%
  # normalization with Al and rename
  mutate(
    across(c(Na_pc, K_pc), ~.x / Al_pc, .names = "{gsub(\"pc\", \"\", .col)}Al"),
    .keep = "unused"
    )

Break 4

PAGES package

Data
lazy load data: bonenburg and kuhjoch as well as the long formats: bonenburg_long and kuhjoch_long

Raw data
PAGES_example()

Examples
- project: vignette("project", package = "PAGES)
- explore: vignette("explore", package = "PAGES)
- model: vignette("model", package = "PAGES)
- wrangle: vignette("wrangle", package = "PAGES)

Slides
render_slides()

References

Dalgaard, Peter. 2008. Introduction to statistics with R. Edited by J Chambers, D Hand, and W. Hardle. Springer. https://doi.org/10.1201/9780429341830-12.

Fox, John, and Sanford Weisberg. 2018. An R companion to applied regression. Sage publications.

Schobben, Martin, Julia Gravendyck, Franziska Mangels, Ulrich Struck, Robert Bussert, Wolfram M. Kürschner, Dieter Korn, P. Martin Sander, and Martin Aberhan. 2019. A comparative study of total organic carbon-\(\delta\)13C signatures in the Triassic–Jurassic transitional beds of the Central European Basin and western Tethys shelf seas.” Newsletters on Stratigraphy 52 (4): 461–86. https://doi.org/10.1127/nos/2019/0499.

Wickham, Hadley, and Garrett Grolemund. 2016. R for data science: import, tidy, transform, visualize, and model data. O’Reilly Media, Inc. https://r4ds.had.co.nz/index.html.

Wilkinson, Leland, Graham Wills, D Rope, Andrew Norton, and Roger Dubbs. 2005. The Grammar of Graphics (Statistics and Computing).

Zuur, Alain F., Elena N. Ieno, Neil J. Walker, Anatoly A. Saveliev, and Graham M. Smith. 2008. Mixed Effects Models and Extensions in Ecology with R. https://doi.org/10.4324/9780429201271-2.